[fix](rpc) Fix AutoReleaseClosure data race with callback reuse by zclllyybb · Pull Request #61782 · apache/doris

zclllyybb · 2026-03-26T15:51:45Z

The callback's call() method may reuse the callback object (e.g., in vdata_stream_sender.h get_send_callback()), triggering a new RPC that mutates response_ and cntl_. If AutoReleaseClosure::Run() invokes call() before checking cntl_->Failed() or response_->status(), it reads the NEW RPC's state instead of the ORIGINAL RPC's result, causing:

*** SIGSEGV address not mapped to object (@0x0) received by PID 238162 (TID 240463 OR 0xfffa2c9898e0) from PID 0; stack trace: ***
 0# doris::signal::(anonymous namespace)::FailureSignalHandler(int, siginfo_t*, void*) at /home/zcp/repo_center/doris_release/doris/be/src/common/signal_handler.h:421
 1# os::Linux::chained_handler(int, siginfo_t*, void*) in /opt/module/doris/java8/jre/lib/aarch64/server/libjvm.so
 2# JVM_handle_linux_signal in /opt/module/doris/java8/jre/lib/aarch64/server/libjvm.so
 3# signalHandler(int, siginfo_t*, void*) in /opt/module/doris/java8/jre/lib/aarch64/server/libjvm.so
 4# 0x0000FFFF0AB107C0 in linux-vdso.so.1
 5# doris::Status doris::Status::create<true>(doris::PStatus const&) at /home/zcp/repo_center/doris_release/doris/be/src/common/status.h:398
 6# void doris::AutoReleaseClosure<doris::PTransmitDataParams, doris::pipeline::ExchangeSendCallback<doris::PTransmitDataResult> >::_process_status<doris::PTransmitDataResult>(doris::PTransmitDataResult*) at /home/zcp/repo_center/doris_release/doris/be/src/util/ref_count_closure.h:128
 7# doris::AutoReleaseClosure<doris::PTransmitDataParams, doris::pipeline::ExchangeSendCallback<doris::PTransmitDataResult> >::Run() at /home/zcp/repo_center/doris_release/doris/be/src/util/ref_count_closure.h:102
 8# brpc::Controller::EndRPC(brpc::Controller::CompletionInfo const&) in /opt/module/doris/be/lib/doris_be
 9# brpc::policy::ProcessRpcResponse(brpc::InputMessageBase*) in /opt/module/doris/be/lib/doris_be
10# brpc::ProcessInputMessage(void*) in /opt/module/doris/be/lib/doris_be
11# bthread::TaskGroup::task_runner(long) in /opt/module/doris/be/lib/doris_be
12# bthread_make_fcontext in /opt/module/doris/be/lib/doris_be

we have confirmed the data race is real existing with temporary LOGs which has been removed:

F20260325 21:46:58.465230 3453395 brpc_closure.h:116] Check failed: _debug_generation_at_construction == current_gen (2 vs. 3) RACE DETECTED: AutoReleaseClosure response_ was reused by a new RPC (generation changed from 2 to 3) while still in Run(). The old closure is about to read response_->status() but the new RPC may be concurrently writing to the same response_ object.

and we add some be-ut which could only pass WITH this patch.
before we fix:

[----------] 7 tests from ExchangeSinkTest (5 ms total)

[----------] Global test environment tear-down
[==========] 7 tests from 1 test suite ran. (5 ms total)
[  PASSED  ] 4 tests.
[  FAILED  ] 3 tests, listed below:
[  FAILED  ] ExchangeSinkTest.test_closure_call_must_not_corrupt_status_check
[  FAILED  ] ExchangeSinkTest.test_closure_call_must_not_hide_error_status
[  FAILED  ] ExchangeSinkTest.test_closure_call_must_not_hide_rpc_failure

after:

[----------] 7 tests from ExchangeSinkTest (4 ms total)

[----------] Global test environment tear-down
[==========] 7 tests from 1 test suite ran. (5 ms total)
[  PASSED  ] 7 tests.

zclllyybb · 2026-03-26T15:51:54Z

/review

hello-stephen · 2026-03-26T15:51:57Z

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

What problem was fixed (it's best to include specific error reporting information). How it was fixed.
Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
What features were added. Why was this function added?
Which code was refactored and why was this part of the code refactored?
Which functions were optimized and what is the difference before and after the optimization?

zclllyybb · 2026-03-26T15:51:58Z

run buildall

Copilot

Pull request overview

Fixes a brpc AutoReleaseClosure race when callbacks are reused to start a new RPC (mutating response_/cntl_), by ensuring status/failure checks happen before invoking callback->call(), and adds unit tests to prevent regressions.

Changes:

Reorders AutoReleaseClosure::Run() so it checks cntl_ / response_->status() before invoking callback->call().
Refactors runtime-filter sync-size RPC handling to keep the callback alive until RPC completion.
Adds BE unit tests that simulate callback reuse mutating shared RPC state.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
be/test/exec/exchange/exchange_sink_test.cpp	Adds tests that simulate callback reuse and verify correct ordering.
be/src/util/brpc_closure.h	Reorders `Run()` and simplifies error handling paths.
be/src/exec/runtime_filter/runtime_filter_producer.h	Adds storage to keep sync-size callback alive across async RPC.
be/src/exec/runtime_filter/runtime_filter_producer.cpp	Introduces `SyncSizeCallback` and changes closure construction/lifetime.
be/src/exec/runtime_filter/runtime_filter_mgr.h	Updates `_send_rf_to_target` signature (removes QueryContext arg).
be/src/exec/runtime_filter/runtime_filter_mgr.cpp	Updates runtime-filter RPC closure construction (drops ctx passing).
be/src/exec/runtime_filter/runtime_filter.cpp	Updates runtime-filter RPC closure construction (drops ctx passing).
be/src/exec/operator/exchange_sink_buffer.cpp	Adds comments documenting callback reuse ordering constraints.
be/src/exec/exchange/vdata_stream_sender.h	Documents callback reuse rationale.

Comments suppressed due to low confidence (1)

be/src/util/brpc_closure.h:1

This change removes the previous QueryContext-based failure propagation (and error_msg augmentation) from AutoReleaseClosure, replacing it with logging only. That’s a behavior change: some call sites previously relied on AutoReleaseClosure to cancel the query on RPC failure / non-OK status(), gated by ignore_runtime_filter_error; after this patch, errors may no longer interrupt the query and could lead to hangs or delayed failure handling. Recommendation (mandatory): restore a mechanism to propagate failures (either by reintroducing the optional QueryContext + captured status/failure info before calling tmp->call(), or by requiring callers to pass a callback that performs the cancel/sub/error handling) while keeping the call() invocation as the last step to avoid the reuse race.

// Licensed to the Apache Software Foundation (ASF) under one

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

be/src/exec/runtime_filter/runtime_filter_producer.cpp

be/src/exec/runtime_filter/runtime_filter_mgr.cpp

be/test/exec/exchange/exchange_sink_test.cpp

be/src/exec/exchange/vdata_stream_sender.h

github-actions

Code Review Summary

PR Goal & Accomplishment

This PR fixes a real data race in AutoReleaseClosure::Run() where call() was invoked before checking cntl_->Failed() and response_->status(). Because call() can trigger callback reuse (e.g., in ExchangeSendCallback::call() -> _suc_fn -> _send_rpc() which reuses the same callback via get_send_callback()), the closure would then read mutated cntl_/response_ state from the new RPC instead of the original. The fix correctly reorders: log/check first, then call() last. The crash stacktrace and debug-log evidence clearly confirm the race.

The secondary change — removing QueryContext cancellation from AutoReleaseClosure and moving error handling into individual callbacks — is also sound. All callback types that need error handling (ExchangeSendCallback, SyncSizeCallback, WriteBlockCallback) already fully handle errors in their call() methods. The DummyBrpcCallback paths either have caller-side join()+check patterns or are fire-and-forget operations where query cancellation was too aggressive.

Compilation Bug Found

The old code had two _process_status overloads: an unconstrained no-op fallback template <typename Response> void _process_status(Response*) {} and a constrained template <HasStatus Response> void _process_status(Response*). The new code only has the constrained _log_error_status<HasStatus Response> but removed the unconstrained fallback. This will cause a compilation failure for PTabletWriterCancelResult (an empty protobuf message with no status() field), used in vtablet_writer.cpp:1229.

Critical Checkpoint Conclusions

Goal accomplished? Yes, the core data race fix is correct. Tests prove the ordering. However, there is a compilation bug (see inline comment).
Modification minimal and focused? Yes. The changes are well-scoped: core fix in brpc_closure.h, consequent cleanup of QueryContext removal, and SyncSizeClosure -> SyncSizeCallback refactor.
Concurrency: The core fix addresses the concurrency issue correctly by ensuring call() (which may trigger callback reuse and new RPCs) is the last operation in Run(). The SyncSizeCallback::call() reads cntl_ and response_ which are safe because call() is now called before any reuse can happen.
Lifecycle management: _sync_size_callback (as shared_ptr<void>) correctly keeps the SyncSizeCallback alive until the RuntimeFilterProducer is destroyed. Without this, AutoReleaseClosure's weak_ptr to the callback would expire when send_size() returns, and the error-path sub() would never fire. This is correct.
Parallel code paths: The two success handler lambdas in exchange_sink_buffer.cpp (unicast and broadcast) both have the same comment added about _send_rpc being last. Good.
Test coverage: Three well-designed unit tests verify the ordering invariant by capturing state at call() time and mutating afterward. Tests correctly fail with the old ordering and pass with the new.
Error handling: The behavioral change from QueryContext::cancel() to LOG(WARNING) is safe — all critical paths already handle errors in their callbacks or via caller-side join()+check.
No incompatible changes: No function symbol or storage format changes.
Configuration: No new configs added.
Observability: Error logging is preserved (downgraded from cancel to LOG(WARNING) in AutoReleaseClosure, but callbacks handle errors appropriately).

be/src/util/brpc_closure.h

zclllyybb · 2026-03-28T16:34:37Z

/review

zclllyybb · 2026-03-28T16:34:41Z

run buildall

github-actions

Code Review Summary

PR Goal

Fix a real data race in AutoReleaseClosure where call() was invoked before checking cntl_->Failed() and response_->status(). When call() reuses the callback (as in ExchangeSendCallback via get_send_callback()), it can trigger a new RPC that mutates response_ and cntl_, causing the subsequent reads to see the new RPC's state instead of the original — leading to SIGSEGV crashes.

Core Fix Assessment

The fix is correct and well-motivated:

Reordering in Run(): Log/check cntl/response state before calling callback_->call(). This prevents the data race.
Removing QueryContext from AutoReleaseClosure: Decentralizes error handling to individual callbacks, which is architecturally cleaner.
SyncSizeCallback refactoring: Moves from subclassing AutoReleaseClosure to subclassing DummyBrpcCallback, properly handling errors in call().
Unit tests: Three well-designed tests verify the ordering invariant.

Critical Checkpoint Conclusions

Goal accomplished? Yes, the data race is fixed by reordering call() to be last in Run(). Tests prove the ordering.
Modification minimal and focused? Mostly yes. The QueryContext removal is a secondary cleanup bundled with the fix.
Concurrency? The core race condition is correctly addressed. The callback reuse pattern in ExchangeSendCallback is safe because _send_rpc (which triggers reuse) is documented and enforced as the last operation in success handlers.
Lifecycle management? _sync_size_callback as shared_ptr<void> correctly keeps the callback alive. The weak_ptr in AutoReleaseClosure allows graceful degradation if the callback owner is gone.
Parallel code paths? Both unicast and broadcast paths in exchange_sink_buffer.cpp are updated with matching comments.
Error handling? See inline comment — the ignore_runtime_filter_error=false + RPC failure path that previously cancelled the query now silently degrades. The existing regression test test_inject_send_filter_size_fail.groovy expects the old cancellation behavior and will likely fail.
Test coverage? Good unit tests for the ordering fix. Missing update to the regression test.
Incompatible changes? No storage/protocol changes. The AutoReleaseClosure constructor signature change is source-breaking but all callers are updated.
Performance? No concerns — the reordering adds no overhead.

be/src/exec/runtime_filter/runtime_filter_producer.cpp

be/src/exec/runtime_filter/runtime_filter.cpp

be/src/util/brpc_closure.h

doris-robot · 2026-03-28T17:23:20Z

TPC-H: Total hot run time: 26928 ms

machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit d29c1f8898175730bb45ddfc4d4440f2f93452b2, data reload: false

------ Round 1 ----------------------------------
orders	Doris	NULL	NULL	0	0	0	NULL	0	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	17651	4492	4276	4276
q2	q3	10669	827	540	540
q4	4699	359	252	252
q5	7581	1224	1012	1012
q6	176	174	142	142
q7	802	852	673	673
q8	10031	1490	1338	1338
q9	5473	4732	4733	4732
q10	6323	1926	1628	1628
q11	471	258	240	240
q12	755	584	467	467
q13	18029	2695	1938	1938
q14	221	236	220	220
q15	q16	729	747	672	672
q17	751	839	441	441
q18	5916	5331	5328	5328
q19	1485	974	611	611
q20	540	498	387	387
q21	4513	1849	1702	1702
q22	410	346	329	329
Total cold run time: 97225 ms
Total hot run time: 26928 ms

----- Round 2, with runtime_filter_mode=off -----
orders	Doris	NULL	NULL	150000000	42	6422171781	NULL	22778155	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	4752	4696	4553	4553
q2	q3	3863	4338	3819	3819
q4	861	1199	768	768
q5	4065	4371	4370	4370
q6	212	184	141	141
q7	1748	1661	1570	1570
q8	2598	2722	2612	2612
q9	7582	7373	7474	7373
q10	3714	3959	3603	3603
q11	529	445	420	420
q12	536	626	469	469
q13	2557	3073	2066	2066
q14	309	312	289	289
q15	q16	744	812	880	812
q17	1212	1405	1392	1392
q18	7366	6920	6663	6663
q19	925	886	919	886
q20	2054	2129	1935	1935
q21	3927	3506	3302	3302
q22	452	426	379	379
Total cold run time: 50006 ms
Total hot run time: 47422 ms

doris-robot · 2026-03-28T17:34:05Z

TPC-DS: Total hot run time: 167957 ms

machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit d29c1f8898175730bb45ddfc4d4440f2f93452b2, data reload: false

query5	4337	671	519	519
query6	327	236	205	205
query7	4215	470	264	264
query8	354	241	226	226
query9	8727	2695	2699	2695
query10	505	401	339	339
query11	7041	5092	4873	4873
query12	181	131	128	128
query13	1286	479	350	350
query14	5753	3733	3430	3430
query14_1	2839	2844	2802	2802
query15	212	193	174	174
query16	1000	467	456	456
query17	886	728	629	629
query18	2448	451	348	348
query19	218	218	196	196
query20	134	125	131	125
query21	216	137	114	114
query22	13201	13431	13205	13205
query23	16166	15796	16178	15796
query23_1	16114	16303	16073	16073
query24	7820	1746	1266	1266
query24_1	1291	1281	1331	1281
query25	567	493	472	472
query26	1561	299	153	153
query27	3460	488	294	294
query28	4497	1844	1820	1820
query29	831	559	469	469
query30	298	221	188	188
query31	1008	939	871	871
query32	83	72	73	72
query33	505	329	288	288
query34	879	851	527	527
query35	662	715	599	599
query36	1042	1126	966	966
query37	140	92	86	86
query38	2910	2872	2866	2866
query39	853	826	808	808
query39_1	800	790	800	790
query40	234	157	136	136
query41	64	64	60	60
query42	258	254	255	254
query43	234	253	221	221
query44	
query45	197	185	184	184
query46	886	1037	600	600
query47	2103	2543	2047	2047
query48	315	318	229	229
query49	637	461	390	390
query50	693	270	211	211
query51	4026	4089	3986	3986
query52	263	262	255	255
query53	288	340	299	299
query54	297	267	256	256
query55	93	88	91	88
query56	315	317	320	317
query57	1876	1672	1715	1672
query58	285	281	267	267
query59	2789	2954	2716	2716
query60	339	338	318	318
query61	158	157	160	157
query62	626	583	539	539
query63	304	272	274	272
query64	5128	1350	1012	1012
query65	
query66	1468	468	358	358
query67	24236	24378	24082	24082
query68	
query69	400	310	287	287
query70	957	936	964	936
query71	324	305	299	299
query72	2876	2771	2487	2487
query73	539	542	316	316
query74	9576	9542	9401	9401
query75	2839	2764	2442	2442
query76	2299	1024	680	680
query77	387	421	313	313
query78	10825	11045	10423	10423
query79	1113	788	564	564
query80	872	637	551	551
query81	526	266	221	221
query82	1327	152	123	123
query83	331	264	245	245
query84	298	113	96	96
query85	976	508	453	453
query86	378	332	294	294
query87	3111	3101	3027	3027
query88	3547	2634	2631	2631
query89	417	372	345	345
query90	1845	177	170	170
query91	167	163	146	146
query92	80	76	72	72
query93	908	851	500	500
query94	532	339	284	284
query95	619	405	327	327
query96	652	510	227	227
query97	2491	2471	2418	2418
query98	245	220	214	214
query99	999	1001	918	918
Total cold run time: 250420 ms
Total hot run time: 167957 ms

hello-stephen · 2026-03-28T19:14:51Z

BE UT Coverage Report

Increment line coverage 🎉

Increment coverage report
Complete coverage report

Category	Coverage
Function Coverage	52.90% (19934/37685)
Line Coverage	36.41% (186790/512958)
Region Coverage	32.68% (144878/443363)
Branch Coverage	33.87% (63518/187534)

hello-stephen · 2026-03-28T19:33:20Z

BE Regression && UT Coverage Report

Increment line coverage 100% (0/0) 🎉

Increment coverage report
Complete coverage report

Category	Coverage
Function Coverage	73.60% (27165/36907)
Line Coverage	57.09% (291971/511420)
Region Coverage	54.41% (243480/447485)
Branch Coverage	56.17% (105664/188100)

zclllyybb · 2026-03-29T10:39:23Z

/review

github-actions

Code Review Summary

PR Goal

Fix a data race in AutoReleaseClosure::Run() where call() was invoked before checking cntl_->Failed() and response_->status(). Since call() can reuse the callback (resetting cntl_ and response_), subsequent reads of these fields in Run() would see the NEW RPC's state instead of the original, causing SIGSEGV crashes.

Core Fix Assessment

The fix is correct and well-motivated. Moving callback->call() to be the last operation in AutoReleaseClosure::Run() (after logging) eliminates the data race. The approach of removing QueryContext from AutoReleaseClosure and pushing error handling into individual callbacks is a clean architectural improvement that makes the responsibilities clearer.

Critical Checkpoint Conclusions

1. Does the code accomplish its goal?
Yes. The reordering in AutoReleaseClosure::Run() ensures cntl_ and response_ are read before call() can mutate them. The new tests prove the ordering.

2. Is the modification minimal and focused?
Mostly yes. The core fix (reordering in brpc_closure.h) is minimal. The SyncSizeCallback refactoring is a necessary consequence. However, the removal of QueryContext-based error propagation from the DummyBrpcCallback paths (merge_filter, sync_filter_size coordinator-side, publish_filter) is a behavioral side-effect that goes beyond the race fix -- see inline comment.

3. Concurrency?
Yes, this is fundamentally a concurrency fix. The critical variable is the shared cntl_/response_ state between AutoReleaseClosure and the callback. The fix ensures the closure reads before the callback writes, without requiring locks (sequential ordering within Run() suffices since the race was between the closure's own read and the callback's write-then-reuse pattern).

4. Lifecycle management?
The _sync_size_callback member in RuntimeFilterProducer correctly keeps the SyncSizeCallback alive beyond send_size() return, compensating for AutoReleaseClosure's weak_ptr storage. Good design.

5. Parallel code paths?
WriteBlockCallback (in vtablet_writer.h) follows the same pattern and is safe: its call() method handles errors internally and calls clear_in_flight() at the very end. The ExchangeSendCallback also handles errors internally in call(). Both are compatible with the new ordering.

6. Test coverage?
Good. Three new tests in exchange_sink_test.cpp verify the ordering invariant. Seven tests in sync_size_callback_test.cpp cover success, failure, ignore_runtime_filter_error, and expired weak_ptr scenarios for SyncSizeCallback.

7. Configuration changes? None.

8. Incompatible changes? None (internal refactoring only).

9. Performance? No concerns. The _log_error_status call before callback->call() is lightweight (just a LOG on error).

10. Behavioral change concern: See inline comment on runtime_filter.cpp. The removal of QueryContext from AutoReleaseClosure means that merge_filter, coordinator-side sync_filter_size, and publish_filter RPC failures no longer cancel the query when ignore_runtime_filter_error=false. This is a behavioral change beyond the race fix scope.

11. Stale comment: runtime_filter_wrapper.h:91 still references SyncSizeClosure (now renamed to SyncSizeCallback). Not in the diff, so noting here for follow-up.

Verdict

The core race fix is correct and well-tested. One behavioral concern is flagged inline regarding the loss of query cancellation on runtime filter RPC failures in non-SyncSizeCallback paths.

github-actions · 2026-03-29T10:46:43Z

be/src/exec/runtime_filter/runtime_filter.cpp

-                                          ? std::weak_ptr<QueryContext> {}
-                                          : state->get_query_ctx_weak());
+                    create_unique(merge_filter_request, merge_filter_callback);
    void* data = nullptr;


Behavioral change: Previously, this AutoReleaseClosure received a weak_ptr<QueryContext> (when ignore_runtime_filter_error=false), and _process_if_rpc_failed() / _process_if_meet_error_status() would cancel the query on merge_filter RPC failure.

Now with the QueryContext removed from AutoReleaseClosure, this path only logs a WARNING on failure -- the query is not cancelled. The same applies to the coordinator-side sync_filter_size and publish_filter RPCs in runtime_filter_mgr.cpp.

If this is intentional (runtime filter failures are non-fatal for correctness), it should be documented in the PR description as a deliberate behavioral change. If not, these paths need custom callbacks (like SyncSizeCallback) that respect ignore_runtime_filter_error and cancel the query when appropriate.

Copilot

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 7 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-29T11:06:42Z

be/src/exec/runtime_filter/runtime_filter.cpp

-                                  state->query_options().ignore_runtime_filter_error
-                                          ? std::weak_ptr<QueryContext> {}
-                                          : state->get_query_ctx_weak());
+                    create_unique(merge_filter_request, merge_filter_callback);


This RPC invocation no longer passes a QueryContext into AutoReleaseClosure, so failures/error statuses will only be logged and won’t cancel the query even when ignore_runtime_filter_error is false. Given _push_to_remote() returns Status::OK() immediately and doesn’t join/inspect the callback, this can silently drop runtime-filter propagation failures. Consider reintroducing query-cancel behavior (checked before invoking callback->call() to avoid the reuse race) or handling errors explicitly in the callback.

Suggested change

create_unique(merge_filter_request, merge_filter_callback);

create_unique(merge_filter_request, merge_filter_callback,

state->get_query_ctx());

Copilot · 2026-03-29T11:06:43Z

be/src/exec/runtime_filter/runtime_filter_producer.h

+        Status status = Status::create(response_->status());
+        if (!status.ok()) {
+            if (auto w = _wrapper.lock()) {
+                w->set_state(RuntimeFilterWrapper::State::DISABLED, status.to_string());
+            }
+            if (auto ctx = _context.lock()) {
+                if (!ctx->ignore_runtime_filter_error()) {
+                    ctx->cancel(Status::NetworkError("RPC meet failed: {}", status.to_string()));
+                }
+            }
+            ((CountedFinishDependency*)_dependency.get())->sub();
+        }


On non-OK response status, the code cancels the query using Status::NetworkError("RPC meet failed: {}", status.to_string()). This misclassifies an application-level error status as a network error and also reuses the “RPC meet failed” wording even though the RPC itself succeeded. Consider cancelling with the actual status (or mapping to a more appropriate error code/message) so error handling/metrics reflect the real failure mode.

Copilot · 2026-03-29T11:06:43Z

be/src/exec/runtime_filter/runtime_filter_producer.h

+    // Should use weak ptr here, because when query context deconstructs, should also delete runtime filter
+    // context, it not the memory is not released. And rpc is in another thread, it will hold rf context
+    // after query context because the rpc is not returned.


Comment grammar is unclear: “context, it not the memory is not released.” This is hard to read and may be misleading for future maintainers.

Suggested change

// Should use weak ptr here, because when query context deconstructs, should also delete runtime filter

// context, it not the memory is not released. And rpc is in another thread, it will hold rf context

// after query context because the rpc is not returned.

// Should use a weak ptr here because when the query context is destroyed, the runtime filter

// context should also be deleted; otherwise its memory may not be released. Since the RPC runs

// in another thread, it may hold the runtime filter context after the query context is destroyed

// if the RPC has not yet returned.

Copilot · 2026-03-29T11:06:43Z

be/src/exec/runtime_filter/runtime_filter_mgr.cpp

        auto closure = AutoReleaseClosure<PPublishFilterRequestV2,
                                          DummyBrpcCallback<PPublishFilterResponse>>::
                create_unique(std::make_shared<PPublishFilterRequestV2>(apply_request),
-                              DummyBrpcCallback<PPublishFilterResponse>::create_shared(), ctx);
+                              DummyBrpcCallback<PPublishFilterResponse>::create_shared());


Similar to sync_filter_size, apply_filterv2 RPCs now use AutoReleaseClosure without any query-context cancellation/propagation. If these RPCs fail, the query may continue without required runtime filters (or hang waiting for them depending on downstream logic) with only a warning log. Consider adding explicit error handling (cancel query when ignore_runtime_filter_error is false, or disable the affected filter and unblock dependencies) in a callback, while keeping call() last to avoid callback reuse races.

Copilot · 2026-03-29T11:06:43Z

be/src/util/brpc_closure.h

    void Run() override {
        Defer defer {[&]() { delete this; }};
-        // If lock failed, it means the callback object is deconstructed, then no need
-        // to deal with the callback any more.
-        if (auto tmp = callback_.lock()) {
-            tmp->call();
-        }
+        // shouldn't do heavy work here. all heavy work should be done in callback's call() (which means in success/failure handlers)
        if (cntl_->Failed()) {
-            _process_if_rpc_failed();
+            LOG(WARNING) << "brpc failed: " << cntl_->ErrorText();
        } else {
-            _process_status<ResponseType>(response_.get());
+            _log_error_status<ResponseType>(response_.get());
+        }
+        // this must be the LAST operation in this function, because call() may reuse the callback! (response_ is in callback_)
+        if (auto tmp = callback_.lock()) {
+            tmp->call();
        }


AutoReleaseClosure::Run() now only logs cntl_->Failed() / response status errors, and then invokes callback->call() last (good for avoiding the reuse race). However, this also removes the previous ability to cancel a QueryContext on RPC failure/error status, which several runtime-filter call sites previously depended on. Consider restoring an optional QueryContext/error-handling hook executed before callback->call() so you keep the ordering fix without weakening error propagation semantics across the codebase.

Copilot · 2026-03-29T11:06:44Z

be/test/exec/exchange/exchange_sink_test.cpp

+// Test: Response starts OK. call() writes an error into it.
+// With correct ordering (log-before-call): the closure's logging sees OK (no warning),
+// then call() runs and the callback captures the OK status at call time.
+// With WRONG ordering (call-before-log): call() writes error first, then the closure
+// would log the error — a false positive. We verify call() saw OK at invocation time,
+// proving it ran after (or at least not before) the status was checked by the closure.
+TEST_F(ExchangeSinkTest, test_closure_call_sees_original_ok_response) {
+    auto callback = TestCallback::create_shared();
+    // Response starts OK (default).
+    callback->set_mutate_action(TestCallback::MutateAction::WRITE_ERROR);
+
+    auto req = std::make_shared<PTransmitDataParams>();
+    auto* closure = new AutoReleaseClosure<PTransmitDataParams, TestCallback>(req, callback);
+
+    closure->Run(); // self-deletes
+
+    EXPECT_TRUE(callback->call_invoked) << "call() should have been invoked";
+    EXPECT_TRUE(callback->response_status_at_call_time.ok())
+            << "call() must see the original OK response status. "
+               "If it saw an error, the ordering is wrong.";
+    EXPECT_FALSE(callback->cntl_failed_at_call_time);


The new tests don’t actually validate the ordering guarantee they describe. StateCapturingCallback::call() captures cntl_/response_ state before it mutates them, so it will observe the “original” state regardless of whether AutoReleaseClosure::Run() calls call() before or after checking cntl_->Failed() / response_->status(). Consider asserting an observable side-effect of the closure’s pre-call check (e.g., capture/expect the warning log output, or add a hook that records what AutoReleaseClosure observed before invoking call()).

Copilot · 2026-03-29T11:06:44Z

be/src/exec/runtime_filter/runtime_filter_producer.h

+        if (cntl_->Failed()) {
+            if (auto w = _wrapper.lock()) {
+                w->set_state(RuntimeFilterWrapper::State::DISABLED, cntl_->ErrorText());
+            }
+            if (auto ctx = _context.lock()) {
+                if (!ctx->ignore_runtime_filter_error()) {
+                    ctx->cancel(Status::NetworkError("RPC meet failed: {}", cntl_->ErrorText()));
+                }
+            }
+            ((CountedFinishDependency*)_dependency.get())->sub();
+            return;


SyncSizeCallback stores _dependency as std::shared_ptr<Dependency> but then downcasts with a C-style cast to CountedFinishDependency to call sub(). This is undefined behavior if a different Dependency subtype is ever passed, and it also bypasses type safety. Prefer storing/passing std::shared_ptr<CountedFinishDependency> (or at least dynamic_cast/dynamic_pointer_cast with a CHECK) and calling sub() directly.

zclllyybb requested review from BiteTheDDDDt, Copilot and yiguolei March 26, 2026 15:51

zclllyybb added usercase Important user case type label dev/4.0.x dev/4.1.x labels Mar 26, 2026

Copilot AI reviewed Mar 26, 2026

View reviewed changes

Copilot started reviewing on behalf of zclllyybb March 26, 2026 16:33 View session

github-actions bot reviewed Mar 26, 2026

View reviewed changes

be/src/util/brpc_closure.h Show resolved Hide resolved

zclllyybb force-pushed the brpc_closure branch from 3d14801 to d29c1f8 Compare March 28, 2026 16:32

github-actions bot reviewed Mar 28, 2026

View reviewed changes

be/src/exec/runtime_filter/runtime_filter_producer.cpp Outdated Show resolved Hide resolved

be/src/exec/runtime_filter/runtime_filter.cpp Show resolved Hide resolved

be/src/util/brpc_closure.h Show resolved Hide resolved

zclllyybb marked this pull request as ready for review March 29, 2026 09:37

all

00e71e9

zclllyybb force-pushed the brpc_closure branch from d29c1f8 to 00e71e9 Compare March 29, 2026 09:49

beut

178c813

github-actions bot reviewed Mar 29, 2026

View reviewed changes

zclllyybb requested a review from Copilot March 29, 2026 10:58

Copilot started reviewing on behalf of zclllyybb March 29, 2026 10:59 View session

Copilot AI reviewed Mar 29, 2026

View reviewed changes

	create_unique(merge_filter_request, merge_filter_callback);
	create_unique(merge_filter_request, merge_filter_callback,
	state->get_query_ctx());

-    // Should use weak ptr here, because when query context deconstructs, should also delete runtime filter
-    // context, it not the memory is not released. And rpc is in another thread, it will hold rf context
-    // after query context because the rpc is not returned.
+    // Should use a weak ptr here because when the query context is destroyed, the runtime filter
+    // context should also be deleted; otherwise its memory may not be released. Since the RPC runs
+    // in another thread, it may hold the runtime filter context after the query context is destroyed
+    // if the RPC has not yet returned.

Conversation

zclllyybb commented Mar 26, 2026

Uh oh!

zclllyybb commented Mar 26, 2026

Uh oh!

hello-stephen commented Mar 26, 2026

Uh oh!

zclllyybb commented Mar 26, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions bot left a comment

Choose a reason for hiding this comment

Code Review Summary

PR Goal & Accomplishment

Compilation Bug Found

Critical Checkpoint Conclusions

Uh oh!

Uh oh!

zclllyybb commented Mar 28, 2026

Uh oh!

zclllyybb commented Mar 28, 2026

Uh oh!

github-actions bot left a comment

Choose a reason for hiding this comment

Code Review Summary

PR Goal

Core Fix Assessment

Critical Checkpoint Conclusions

Uh oh!

Uh oh!

Uh oh!

Uh oh!

doris-robot commented Mar 28, 2026

Uh oh!

doris-robot commented Mar 28, 2026

Uh oh!

hello-stephen commented Mar 28, 2026

BE UT Coverage Report

Uh oh!

hello-stephen commented Mar 28, 2026

BE Regression && UT Coverage Report

Uh oh!

zclllyybb commented Mar 29, 2026

Uh oh!

github-actions bot left a comment

Choose a reason for hiding this comment

Code Review Summary

PR Goal

Core Fix Assessment

Critical Checkpoint Conclusions

Verdict

Uh oh!

github-actions bot Mar 29, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Mar 29, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 29, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 29, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 29, 2026

Choose a reason for hiding this comment

Uh oh!